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Abstract 

This paper presents several novel generalization bounds for the problem of learning kernels based 
on the analysis of the Rademacher complexity of the corresponding hypothesis sets. Our bound for 
learning kernels with a convex combination of p base kernels has only a \ogp dependency on the 
number of kernels, p, which is considerably more favorable than the previous best bound given for 
the same problem. We also give a novel bound for learning with a linear combination of p base 
kernels with an L2 regularization whose dependency on p is only in p 1 / 4 . 

1 Introduction 

Kernel methods are widely used in statistical learning ifTTl [TBI . Positive definite symmetric (PDS) kernels 
specify an inner product in an implicit Hilbert space where large-margin methods are used for learning and 
estimation. They can be combined with algorithms such as support vector machines (SVMs) JSjQJJGQ) or 
other kernel-based algorithms to form powerful learning techniques. 

But, the choice of the kernel, which is critical to the success of the algorithm, is typically left to the 
user. Rather than requesting the user to commit to a specific kernel, which may not be optimal for the task, 
especially if the user's prior knowledge about the task is poor, learning kernel methods require him only to 
specify a family of kernels. The learning algorithm then selects both the specific kernel out of that family, 
and the hypothesis defined with respect to that kernel. 

There is a large body of literature dealing with various aspects of the problem of learning kernels, 
including theoretical questions, optimization problems related to this problem, and experimental results 
JO] [T31 12 [JJ [19] HU [IH E3l HI] El [SJ ESI m. Some of this previous work considers families of Gaussian 
kernels [TT31 or hyperkernels |[T6l . Non-linear combinations of kernels have been recently considered by 
ll2Tl [3ll9l. But, the most common family of kernels examined is that of non-negative combinations of some 
fixed kernels constrained by a trace condition, which can be viewed as an L\ regularization fOl . or by an 
regularization (8). 

This paper presents several novel generalization bounds for the problem of learning kernels for the 
family of convex combinations of base kernels or linear combinations with an L2 constraint. One of the 
first learning bounds given by Lanckriet et al. |[T3l for the family of convex combinations of p base ker- 
nels is similar to that of Bousquet and Herrmann (6) and has the following form: R(h) < R p (h) + 
O ( \/ max Li Tr(Kfc) max^dlKfcH/ Tr(Kfc))/p 2 ) where R(h) is the generalization error of a hypoth- 
esis h, R p {h) is the fraction of training points with margin less than or equal to p and is the kernel matrix 
associated to the fcth base kernel. This bound was later shown by Srebro and Ben-David |[T9l to be always 
larger than one. Another bound by Lanckriet et al. |[T3l for the family of linear combinations of base kernels 
was also shown by the same authors to be always larger than one. 

But Lanckriet et al. |[T3l also presented a multiplicative bound for convex combinations of base kernels 



that is of the form R(h) < R p (h) + O (J This bound converges and can perhaps be viewed as 

the first informative generalization bound for this family of kernels. However, the dependence of the bound 
on the number of kernels p is multiplicative which therefore does not encourage the use of too many base 
kernels. Srebro and Ben-David |fT9l presented a generalization bound based on the pseudo-dimension of 
the family of kernels that significantly improved on this bound. Their bound has the form R(h) < R p (h) + 

o(/\J — ^ r / P ) , where the notation O(-) hides logarithmic terms and where R is an upper bound on K^{x, x) 



for all points x and base kernels k k , k € Thus, disregarding logarithmic terms, their bound is only 

additive in p. Their analysis also applies to other families of kernels. Ying and Campbell Il22l also give 
generalization bounds for learning kernels based on the notion of Rademacher chaos complexity and the 
pseudo-dimension of the family of kernels used. It is not clear however how their bound compares to that 
of Srebro and Ben-David. We present new generalization bounds for the family of convex combinations 
of base kernels that have only a logarithmic dependency on p. Our learning bound is based on a careful 
analysis of the Rademacher complexity of the hypothesis set considered and has the form: R(h) < R p (h) + 



O 



(log p)R 2 /p 2 



Our bound is simpler and contains no other extra logarithmic term. Thus, this represents 



a substantial improvement over the previous best bounds for this problem. Our bound is also valid for a very 
large number of kernels, in particular for p ^S> m, while the previous bounds were not informative in that 
case. 

We also present new generalization bounds for the family of linear combinations of base kernels with 
an L 2 regularization. We had previously given a stability bound for an algorithm extending kernel ridge 
regression to learning kernels that had an additive dependency with respect to p [ 8 1 assuming a technical 
condition of orthogonality on the base kernels. The complexity term of our bound was of the form 0(l/y / m+ 
\Jpjm). Our new learning bound admits only a mild dependency of p 1 / 4 on the number of base kernels. 

The next section (Section |2]i defines the family of kernels and hypothesis sets we examine. Section [3] 
presents a bound on the Rademacher complexity of the class of convex combinations of base kernels with an 
L\ constraint and a generalization bond for binary classification directly derived from that result. Similarly, 
Section |4]presents first a bound on the Rademacher complexity, then a generalization bound for the case of 
an of L2 regularization. 

2 Preliminaries 

Most learning kernel algorithms are based on a hypothesis set derived from convex combinations of a fixed 
set of kernels K\ 

H p = I ^ a l K(x i , •) : K = j2fi k K k ,fi k > 0,^ fc = 1,q t Kq < 1/p 2 }. (1) 

i=l k=l k=l 

Note that linear combinations with possibly negative mixture weights have also been considered in the liter- 
ature, e.g., 0~3), however these combinations do not ensure that the combined kernel is PDS. 

We also consider the hypothesis set H' based on a L2 condition on the vector fi and defined as follows: 

m p p 

H 'p = {J2 ®i K (*ii -)- K = Y, VkK k , fx k > 0, A = !> « ' K« < l/p 2 ). (2) 
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fe=l 



fe=l 



We bound the empirical Rademacher complexity SRs(H p ) or D\s(H' p ) of these families for an arbitrary sam- 
ple S of size m, which immediately yields a generalization bound for learning kernels based on this family 
of hypotheses. For a fixed sample S = (x\ , . . . , x m ), the empirical Rademacher complexity of a hypothesis 
set H is defined as 



&s(H) = — E sup V <nh(xi) 

m n »^rf — ^ 



heH ' 



(3) 



The expectation is taken over a = {a\ ,a n ) where ct^s are independent uniform random variables taking 
values in { — 1, +1}. 



Let h G H v , then 



p m 



h(x) = ^ aiK(xi,x) = X! X! l J -ka l K k (x l ,x) = w • $(ar), 

-*i(x)- 



(4) 



1=1 
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with 3>fc = K k (x,-), for all 



where w = : with w fe = p k Yh=i a i&k(%i) and 

ke[i, P }. 

3 Rademacher complexity bound for H p 

Theorem 1 For any sample S of size m, the Rademacher complexity of the hypothesis set H p can be bounded 
as follows: 



^s(H p ) < 



mp 



with r = (vVR|Ki], . . . , v/rTr[K p ])~ 



(5) 



for any even integer r > 0. If additionally, Kk(x, x) < R 2 for all x G X and k G [l,f>], then, for p > 1, 



2e\logp]R 2 /p 2 



Proof: Fix a sample S, then 9\.g(H p ) can be bounded as follows for the hypothesis set of kernel learning 
algorithms for any q, r > 1 with l/q+ 1/r = 1: 



&s(H p ) = -E [ sup Jlaihfa) 



m a i heH . 



v i=l 
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m <* 



< — E 

m v 



supw • 2J 
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r\ 1/r 



(Lemma [5]) 
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We bound each of these two factors separately. The first term can be bounded as follows. 

P \l/9 P 

X^ll^ll 9 ) < X](ll Wfc ll ? ) (sub-additivityofa;^x 1/9 ,(l/q) < 1) 

k=l 

p m 

= E^lE Q;i *' £ ( a;i )l! 



/c=i 



fe=l i=l 



< 



\ E 1 1 X] ai * fc ( Xi ) 1 1 2 ( convexit y) 

\ fe=i t=i 



, ^ p k a T K k a = Va T Ka < 1/p. 

\ k=l 



We bound the second term as follows: 



E 



p m 



(EiiE^^)ir 



k=l i=l 



1/r 



< (?[EiiE^Mr 

°" k=l i=l 
p rn 

Ef[iiE^*^)ir" 



1/r 



1/r 



(Jensen's inequality) 



•fc=l i=l 

Suppose that r is an even integer, r = 2r'. Then, we can bound the expectation as follows: 



m m 

f [ll ^2&i®k(xi)\\ r = e[( E Vi<rjKk(xi,- 
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* E 

l<zi , . . .,2^/ <m 
l<il, — ,Jr'<™ 
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E 



E 



E 



cri 1 cr jl ■ ■■o- ir ,aj rl 



{Kkix^ , XjJ • • • K k (x ir , , , )) 1/2 



(isT/ c (x : , 1 , ) • ■ • Kk(xj ,,xj , )) 1 ^ 2 (Cauchy-Schwarz) 



E 



\si ,...,s„ 



Si+...+s m =2r' 



E 



A" fe (a;i, xi) si/2 • • • K k {x m , x m ) s ™/ 2 . 



Since E[<rJ = for all i and since the Rademacher variables are independent, we can write Efer^ . . . crjj 



E[criJ • • • E[<7j,] = for any I distinct variables cr,^ , . . . , a^ t . Thus, E a 
even, in which case E<j j^er* 1 • • • of^J = 1. Therefore, the following inequality holdsQ 



unless all s^s are 



E[||5^*fc(xi)|| r ] < (^Z'^tjK^i^iY 1 ■■■K k (x mi x m ) t ™ 

i=l 2t 1 +...+2t m =2r' 

<(2r') r ' Y, Gj.'tJ^i^i)' 1 ■■■K k (x m ,x m ) t 



ti+...+t m =r' 

= (2r'Tr[K k ]f = (r Tr[K k }) r / 2 . 
Thus, the Rademacher complexity is bounded by 



K s {H p )<±-^- with t = (\J r Tr[Ki], . . . , ^ Tr[K p ]) T , (6) 

for any even integer r. 

Assume that K k (x, x) < R 2 for all x € X and k G Then, Tr[K fe ] < mR 2 for any k e [l,p], thus 

the Rademacher complexity can be bounded as follows 



^s(tf») < — (p(V^R 2 ) r ) 1/r =p 1 / r r 1 ' 2 \ R2/p2 
mp V 



mp V m 

Forp > 1, the function r i— ► p 1 / r r 1 ' 2 reaches its minimum at tq = 21ogp. This gives 



m s (H p) <-l 2e ^ R2 / p2 



It is likely that the constants in the bound of theorem can be further improved. We used a very rough 
upper bound for the multinomial coefficients. A finer bound using Sterling's approximation should provide a 
better result. Remarkably, the bound of the theorem has a very mild dependency with respect to p. 

The theorem can be used to derive generalization bounds for learning kernels in classification, regression, 
and other tasks. We briefly illustrate its application to binary classification where the labels y are in { — 1, +1}. 
Let R(h) denote the generalization error of h e H p , that is R(h) = Pr[yh(x) < 0]. For a training sample 

S = {{x\ , yi), . . . , (x m , Um)) an d an y P > 0, let R p (h) denote the fraction of the training points with margin 

less than or equal to p, that is R p (h) = — Y^iLi ^Vih(xi)<p- Then, the following result holds. 

Corollary 2 For any S > 0, with probability at least 1 — S, the following bound holds for any h € H p : 



2|M| r , „. log 



R(h)<R P (h) + -*-^+2d-^-. (7) 
mp y 2m 



with t = (-y/rTrpKi], . . . , -\/rTr[Kp]) T , for any even integer r > 0. If additionally, K k (x, x) < R 2 for all 
x G X and k G [l,p], then, for p > 1, 



R(h) < R p (h) + 2 



2e[logp]R 2 /p 2 | 2 /log | 
m V 2m 



Proof: With our definition of the Rademacher complexity, for any 6 > 0, with probability at least 1 — 5, the 
following bound holds for any h G H p lfl2l [4] : 



2 



R{h)<R p {h) + 2%(H p )+2\r^-. (8) 

V 2m 



'We use the following rather rough inequality 

, 2r , (2rQ! (2rQ! (2rQ ■ ■ ■ (r' + 1) ■ r'\ (2r'y' ■ r'\ _ , w f r > 

\.2 tl ,...,2t m ) (2tl) , . . . (2tm) , - . . . ( tm) , - . . . ( tm) , - . . . (tm) | 



Plugging in the bound on the empirical Rademacher complexity given by Theorem[T]yields the statement of 
the corollary. ■ 



The corollary gives a generalization bound for learning kernels with H p that is in 



(9) 



In comparison, the bound for this problem given by Srebro and Ben-David |fl9l using the pseudo-dimension 
has a stronger dependency with respect to p and is more complex: 



O 



2+plog 



. 128em 3 _R 2 



m 



(10) 



This bound is also not informative for p > m. 

4 Rademacher complexity bound for H' p 

Theorem 3 For any sample S of size m, the Rademacher complexity of the hypothesis set H' p can be bounded 
as follows: 

K s (H' p )<^- with t = (v/rTrpKi], . . . , Tr[K p ]) T , (11) 
H mp 

for any even integer § < r < 4. If additionally, Kk{x,x) < R 2 for all x G X and k G then, for any 

P > 1, 



R 2 /p 2 



This bound also hold without the condition /ifc > 0, fc G [L,p]> on the hypothesis set H' p . 

Proof: We can proceed as in the proof for bounding the Rademacher complexity of H p , except for bounding 

the following term: 



k=l 



1/9 



[£>«(a T K fe a) 



q/2 



k=l 
P 



1/9 



< 



fe=i 

r P 



a T K,a)«/ 2 



p\p^ a T K fc a) 9/2 



1/9 



1/9 



(convexity) 



k=l 



P y 4(q-l) 

\ fc=i 



Assume now that g > 4/3, which implies 
Thus, for any g > 4/3, we can write: 



4(9-1) 



< 1. Then, since [ik G [0, 1], this implies p k q < pu- 



k=l 



Wfc 



1/9 



< 



^ ft a T K t a = Va T Ka < 1/p 2 
\ fc=i 



Taking the limit q — > 4/3 shows that the inequality is also verified for q = 4/3. Thus, as in the proof for H p 
the Rademacher complexity can be bounded as follows 



mp 



with r = (VrTrfKi], . . . , Tr[K p ]) T , 



(12) 



but here r is an even integer such that 1/r = 1 — 1/q > 1 — 3/4 = 1/4, that is r < 4. Assume that 
Kk{x, x) < R 2 for all x G X and G [l,p]. Then, Tr[Kfe] < mR 2 for any fc G [l,p], thus, for r = 4, the 
Rademacher complexity can be bounded as follows 



1 



*s(H') < — (p(\/4^) 4 ) 1 / 4 = 2p 1 ' A 
F mp 



R 2 /p 2 



Thus, in this case, the bound has a mild dependence (yO on the number of kernels p. Proceeding as in 
the L\ case leads to the following margin bound in binary classification. 

Corollary 4 For any S > 0, with probability at least 1 — 6, the following bound holds for any h G H p : 



2\\r\\ r /log | 
R(h)<R p (h) + -^+2\^. (13) 
mp y 2m 

with t = {\J r Tr[Ki], . . . , */ r Tr[K p ]) T , for any even integer r G {2, 4}. If additionally, Kk(x, x) < i? 2 
/or aW x G X fc G [l,f>], then, for any p > 1, 

5 Conclusion 

We presented several new generalization bounds for the problem of learning kernels with non-negative com- 
binations of base kernels. Our bounds are simpler and significantly improve over previous bounds. Their 
very mild dependency on the number of kernels seems to suggest the use of a large number of kernels for this 
problem. Our experiments with this problem in regression using a large number of kernels seems to corrob- 
orate this idea J8). Much needs to be done however to combine these theoretical findings with the somewhat 
disappointing performance observed in practice in most learning kernel experiments 0. 
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A Lemma H] 

The following lemma is a straightforward version of Holder's inequality. 

Lemma 5 Let q 1 r > 1 with 1/q + 1/r = 1. Then, the following result similar to Holder's inequality holds: 



w-*(z)i<(Eimi 9 E *w 
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r\ 1/r 



Proof: Let tf,(w) = (££ =1 H w fcll 9 ) 1/9 and *r(*(z)) = (ELi ll*fc(*)ir) 1/r . then 



|w • <&(.t) 
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*,(w)* r (*(a;)) 
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*r(*(aO) 




*fc(a:) 


l*,(w) 
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l|w*|| 


ll**0«OII 



fc=l 
p 



< 



1 ||w fc ||" 1 ||* fc (a;)|r 



g* 9 (w)9 r*,.(*(x)) r 



(Cauchy-Schwarz) 
(Young's inequality) 
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(14) 



